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Abstract 

The difficulties involved in spelling er- 
ror detection and correction in a lan- 
guage have been investigated in this work 
through the conceptualization of SpellNet 
- the weighted network of words, where 
edges indicate orthographic proximity be- 
tween two words. We construct SpellNets 
for three languages - Bengali, English and 
Hindi. Through appropriate mathemati- 
cal analysis and/or intuitive justification, 
we interpret the different topological met- 
rics of SpellNet from the perspective of 
the issues related to spell-checking. We 
make many interesting observations, the 
most significant among them being that 
the probability of making a real word error 
in a language is propotionate to the aver- 
age weighted degree of SpellNet, which is 
found to be highest for Hindi, followed by 
Bengali and English. 

1 Introduction 

Spell-checking is a well researched area in NLP, 
which deals with detection and automatic correc- 
tion of spelling errors in an electronic text docu- 
ment. Several approaches to spell-checking have 
been described in the literature that use statistical, 
rule-based, dictionary-based or hybrid techniques 
(see ( jKukich, 1992] ) for a dated but substantial sur- 
vey). Spelling errors are broadly classified as non- 
word errors (NWE) and real word errors (RWE). If 
the misspelt string is a valid word in the language, 
then it is called an RWE, else it is an NWE. For ex- 
ample, in English, the word "fun" might be misspelt 



as "gun" or "vun"; while the former is an RWE, the 
latter is a case of NWE. It is easy to detect an NWE, 
but correction process is non-trivial. RWE, on the 
other hand are extremely difficult to detect as it re- 
quires syntactic and semantic analysis of the text, 
though the difficulty of correction is comparable to 
that of NWE (see ( [Hirst and Budanitsky, 2005| ) and 
references therein). 

Given a lexicon of a particular language, how 
hard is it to develop a perfect spell-checker for that 
language? Since context-insensitive spell-checkers 
cannot detect RWE and neither they can effec- 
tively correct NWE, the difficulty in building a per- 
fect spell-checker, therefore, is reflected by quan- 
tities such as the probability of a misspelling be- 
ing RWE, probability of more than one word be- 
ing orthographically closer to an NWE, and so 
on. In this work, we make an attempt to under- 
stand and formalize some of these issues related to 
the challenges of spell-checking through a complex 
network approach (see dAlbert and Barabasi, 2002[ 
Newman, 2003 ) for a review of the field). This in 



turn allows us to provide language-specific quan- 
titative bounds on the performance level of spell- 
checkers. 

In order to formally represent the orthographic 
structure (spelling conventions) of a language, we 
conceptualize the lexicon as a weighted network, 
where the nodes represent the words and the weights 
of the edges indicate the orthoraphic similarity be- 
tween the pair of nodes (read words) they connect. 
We shall call this network the Spelling Network 
or SpellNet for short. We build the SpellNets for 
three languages - Bengali, English and Hindi, and 
carry out standard topological analysis of the net- 
works following complex network theory. Through 



appropriate mathematical analysis and/or intuitive 
justification, we interpret the different topological 
metrics of SpellNet from the perspective of diffi- 
culties related to spell-checking. Finally, we make 
several cross-linguistic observations, both invari- 
ances and variances, revealing quite a few inter- 
esting facts. For example, we see that among the 
three languages studied, the probability of RWE is 
highest in Hindi followed by Bengali and English. 
A similar observation has been previously reported 



in flBhatt et al, 2005) ) for RWEs in Bengali and En- 
glish. 

Apart from providing insight into spell-checking, 
the complex structure of SpellNet also reveals the 
self-organization and evolutionary dynamics under- 
lying the orthographic properties of natural lan- 
guages. In recent times, complex networks have 
been successfully employed to model and explain 
the structure and organization of several natural 
and social phenomena, such as the foodweb, pro- 
tien interaction, formation of language invento- 



ries (Choudhury etal., 2006), syntactic structure of 



languages ( |i Cancho and Sole, 20 04), WWW, social 
collaboration, scientific citations and many more 
(see ( [Albert and Barabasi, 20021 |Newman, 2003] ) 



and references therein). This work is inspired 
by the aforementioned models, and more specif- 
ically a couple of similar works on phonological 
neighbors' network of words ( Kapatsinski, 2006 



|Vitevitch, 2005[ ), which try to explain the human 
perceptual and cognitive processes in terms of the 
organization of the mental lexicon. 

The rest of the paper is organized as follows. Sec- 
tion |2] defines the structure and construction pro- 
cedure of SpellNet. Section [3] and [4] describes the 
degree and clustering related properties of Spell- 
Net and their significance in the context of spell- 
checking, respectively. Section [5] summarizes the 
findings and discusses possible directions for future 
work. The derivation of the probability of RWE in a 
language is presented in Appendix A. 

2 SpellNet: Definition and Construction 

In order to study and formalize the orthographic 
characteristics of a language, we model the lexicon 
A of the language as an undirected and fully con- 
nected weighted graph G(V, E). Each word w G A 




Figure 1 : The structure of SpellNet: (a) the weighted 
SpellNet for 6 English words, (b) Thresholded coun- 
terpart of (a), for 9 = 1 



is represented by a vertex v w €f V, and for every 
pair of vertices v w and v w ' in V, there is an edge 
(v w ,v w /) € E. The weight of the edge (v w , v w i), is 
equal to ed{w, w') - the orthographic edit distance 
between w and w' (considering substitution, dele- 
tion and insertion to have a cost of 1). Each node 
v w € V is also assigned a node weight Wv{v w ) 
equal to the unigram occurrence frequency of the 
word w. We shall refer to the graph G(V, E) as the 
SpellNet. Figure QJa) shows a hypothetical SpellNet 
for 6 common English words. 

We define unweighted versions of the graph 
G(V, E) through the concept of thresholding as 
described below. For a threshold 9, the graph 
Gq{V, E$) is an unweighted sub-graph of G(V, E), 
where an edge (v w , v w > ) G E is assigned a weight 1 
in Eg if and only if the weight of the edge is less than 
or equal to 6, else it is assigned a weight 0. In other 
words, Eg consists of only those edges in E whose 
edge weight is less than or equal to 9. Note that all 
the edges in Eq are unweighted. Figure (Ub) shows 
the thresholded SpellNet shown inQJa) for 9 = 1. 

2.1 Construction of SpellNets 

We construct the SpellNets for three languages - 
Bengali, English and Hindi. While the two Indian 
languages - Bengali and Hindi - use Brahmi derived 
scripts - Bengali and Devanagari respectively, En- 
glish uses the Roman script. Moreover, the orthog- 
raphy of the two Indian languages are highly phone- 
mic in nature, in contrast to the morpheme-based or- 
thography of English. Another point of disparity lies 
in the fact that while the English alphabet consists 



of 26 characters, the alphabet size of both Hindi and 
Bengali is around 50. 

The lexica for the three languages have 
been taken from public sources. For En- 
glish it has been obtained from the website 
www.audiencedialogue.org/susteng.html; for Hindi 
and Bengali, the word lists as well as the unigram 
frequencies have been estimated from the mono- 
lingual corpora published by Central Institute of 
Indian Languages. We chose to work with the most 
frequent 10000 words, as the medium size of the 
two Indian language corpora (around 3M words 
each) does not provide sufficient data for estimation 
of the unigram frequencies of a large number 
of words (say 50000). Therefore, all the results 
described in this work pertain to the SpellNets 
corresponding to the most frequent 10000 words. 
However, we believe that the trends observed do not 
reverse as we increase the size of the networks. 

In this paper, we focus on the networks at three 
different thresholds, that is for 9 = 1,3, 5, and study 
the properties of Gg for the three languages. We 
do not go for higher thresholds as the networks be- 
come completely connected at 9 = 5. Table Q] re- 
ports the values of different topological metrics of 
the SpellNets for the three languages at three thresh- 
olds. In the following two sections, we describe in 
detail some of the topological properties of Spell- 
Net, their implications to spell-checking, and obser- 
vations in the three languages. 

3 Degree Distribution 

The degree of a vertex in a network is the num- 
ber of edges incident on that vertex. Let be the 
probability that a randomly chosen vertex has de- 
gree k or more than k. A plot of P^ for any given 
network can be formed by making a histogram of 
the degrees of the vertices, and this plot is known 
as the cumulative degree distribution of the net- 
work ( |Newman, 2003 1 ). The (cumulative) degree 
distribution of a network provides important insights 
into the topological properties of the network. 

Figure |2] shows the plots for the cumulative de- 
gree distribution for 9 = 1,3,5, plotted on a log- 
linear scale. The linear nature of the curves in 
the semi-logarithmic scale indicates that the dis- 
tribution is exponential in nature. The exponen- 



tial behaviour is clearly visible for 9 = 1, how- 
ever at higher thresholds, there are very few nodes 
in the network with low degrees, and therefore 
only the tail of the curve shows a pure exponen- 
tial behavior. We also observe that the steepness 
(i.e. slope) of the log(Pfe) with respect to k in- 
creases with 9. It is interesting to note that al- 
though most of the naturally and socially occurring 
networks exhibit a power-law degree distribution 
(see d Albert and Barabasi, 2002] |Newman, 2003 



i Cancho and Sole, 20041 |Choudhury et al, 2006| ) 
and references therein), SpellNets feature exponen- 
tial degree distribution. Nevertheless, similar results 
have also been reported for the phonological neigh- 



bors' network ( [Kapatsinski, 2006] ) . 



3.1 Average Degree 

Let the degree of the node v be denoted by k{v). We 
define the quantities - the average degree (k) and the 
weighted average degree (k w t) for a given network 
as follows (we drop the subscript w for clarity of 
notation). 

(i) 



{k wt ) 



E v evHv)Wy(v) 
T, ve vW v (v) 



(2) 



where N is the number of nodes in the network. 

Implication: The average weighted degree of 
SpellNet can be interpreted as the probability of 
RWE in a language. This correlation can be derived 
as follows. Given a lexicon A of a language, it can 
be shown that the probability of RWE in a language, 
denoted by p rwe (A) is given by the following equa- 
tion (see Appendix A for the derivation) 

*W(A) = E E P ed{w ' w ' ] P{w) (3) 
weA w'eA 

Let neighbor(w, d) be the number of words in A 
whose edit distance from w is d. Eqn[3]can be rewrit- 
ten in terms of neighbor(w, d) as follows. 

oo 

Prwe(A) = X! neighbor (w,d)p(w) (4) 

w&A d=l 

Practically, we can always assume that d is 
bounded by a small positive integer. In other 
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Table 1: Various topological metrics and their associated values for the SpellNets of the three languages 
at thresholds 1, 3 and 5. Metrics: M - number of edges; (k) - average degree; (k wt ) - average weighted 
degree; (CC) - average clustering coefficient; (CC w t) - average weighted clustering coefficient; r^d - 
Pearson correlation coefficient between degrees of neighbors; (/) - average shortest path; D - diameter. 
N.E - Not Estimated. See the text for further details on definition, computation and significance of the 
metrics. 
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Figure 2: Cumulative degree distribution of SpellNets at different thresholds presented in semi-logarithmic 
scale. 



words, the number of errors simultaneously made 
on a word is always small (usually assumed to 
be 1 or a slowly growing function of the word 
length (Kukich 7l992[ )). Let us denote this bound by 
9. Therefore, 

9 

p rwe (h) « ^2 ^2p d neighbor (w,d)p(w) (5) 

uiSA d=l 

Since p < 1, we can substitute p d by p to get an 
upper bound on p rwe (A), which gives 

8 

Prwe(k) < P^2Y1 neighbor(w, d)p(w) (6) 

ujSA d=l 

The term J2d=i neighbor(w, d) computes the 
number of words in the lexicon, whose edit distance 
from w is atmost 6. This is nothing but k(v w ), i.e. 



the degree of the node v w , in Gq. Moreover, the term 
p(w) is proportionate to the node weight Wv(v w ). 
Thus, rewriting Eqn [6] in terms of the network pa- 
rameters for Gq, we get (subscript w is dropped for 
clarity) 



;w(A) < p 



(7) 



EvevWv(v) 

Comparing Eqn [2] with the above equation, we can 
directly obtain the relation 



Prwe 

(A) < Ci(k wt ) 



(8) 



where C\ is some constant of proportionality. Note 
that for 6 = 1, p rwe (A) oc (k wt ). If we ignore 
the distribution of the words, that is if we assume 
p(w) = l/N, fhenp r ™ e (A) a (k). 

Thus, the quantity (k wt ) provides a good estimate 
of the probability of RWE in a language. 
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Figure 3: Scatter-plots for degree versus unigram 
frequency at different 9 for Hindi 

Observations and Inference: At 9 = 1, the av- 
erage weighted degrees for Hindi, Bengali and En- 
glish are 13.81, 7.73 and 6.61 respectively. Thus, the 
probability of RWE in Hindi is significantly higher 
than that of Bengali, which in turn is higher than 



that of English ( |Bhatt et al, 2005| ). Similar trends 
are observed at all the thresholds for both (k wt ) and 
(k). This is also evident from Figures 12 which show 
the distribution of Hindi to lie above that of Bengali, 
which lies above English (for all thresholds). 

The average degree (k) is substantially smaller 
(0.5 to 0.33 times) than the average weighted de- 
gree (k w t) for all the 9 SpellNets. This suggests 
that the higher degree nodes in SpellNet have higher 
node weight (i.e. occurrence frequency). Indeed, as 
shown in Figure [3] for Hindi, the high unigram fre- 
quency of a node implies higher degree, though the 
reverse is not true. The scatter-plots for the other 
languages are similar in nature. 

3.2 Correlation between Degrees of Neighbors 

The relation between the degrees of adjacent words 
is described by the degree assortativity coefficient. 
One way to define the assortativity of a network is 
through the Pearson correlation coefficient between 
the degrees of the two vertices connected by an edge. 
Each edge (u, v) in the network adds a data item 
corresponding to the degrees of u and v to two data 
sets x and y respectively. The Pearson correlation 
coefficient for the data sets x and y of n items each 
is then defined as 

= nj^xy -J2xJ2v 

" v>£^-G>)^][n£^-(£^] 



Observation: r is positive for the networks in 
which words tend to associate with other words of 
similar degree (i.e. high degree with high degree 
and vice versa), and it is negative for networks in 
which words associate with words having degrees 
in the opposite spectrum. Refering to table [T] we 
see that the correlation coefficient rdd is roughly the 
same and equal to around 0.7 for all languages at 
9 = 1. As 9 increases, the correlation decreases as 
expected, due to the addition of edges between dis- 
similar words. 

Implication: The high positive conelation coeffi- 
cients suggest that SpellNets feature assortative mix- 
ing of nodes in terms of degrees. If there is an RWE 
corresponding to a high degree node v w , then due 
to the assortative mixing of nodes, the misspelling 
w' obtained from w, is also expected to have a high 
degree. Since w' has a high degree, even after detec- 
tion of the fact that w' is a misspelling, choosing the 
right suggestion (i.e. w) is extremely difficult un- 
less the linguistic context of the word is taken into 
account. Thus, more often than not it is difficult to 
correct an RWE, even after successful detection. 

4 Clustering and Small World Properties 

In the previous section, we looked at some of the de- 
gree based features of SpellNets. These features pro- 
vide us insights regarding the probability of RWE in 
a language and the level of difficulty in correcting 
the same. In this section, we discuss some of the 
other characteristics of SpellNets that are useful in 
predicting the difficulty of non-word error correc- 
tion. 

4.1 Clustering Coefficient 

Recall that in the presence of a complete list of valid 
words in a language, detection of NWE is a trivial 
task. However, correction of NWE is far from triv- 
ial. Spell-checkers usually generate a suggestion list 
of possible candidate words that are within a small 
edit distance of the misspelling. Thus, correction be- 
comes hard as the number of words within a given 
edit distance from the misspelling increases. Sup- 
pose that a word w € A is transformed into w' due 
to some typing error, such that w' ^ A. Also assume 
that ed(w, w') < 9. We want to estimate the number 
of words in A that are within an edit distance 9 of 



w' . In other words we are interested in finding out 
the degree of the node v w i in Gq, but since there is 
no such node in SpellNet, we cannot compute this 
quantity directly. Nevertheless, we can provide an 
approximate estimate of the same as follows. 

Let us conceive of a hypothetical node v w i. By 
definition of SpellNet, there should be an edge con- 
necting v w i and v w in Gq. A crude estimate of 
k(v w i) can be (k w t) of Gq. Due to the assortative 
nature of the network, we expect to see a high corre- 
lation between the values of k(v w ) and k(v w >), and 
therefore, a slightly better estimate of k(v w t) could 
be k(v w ). However, as v w > is not a part of the net- 
work, it's behavior in SpellNet may not resemble 
that of a real node, and such estimates can be grossly 
erroneous. 

One way to circumvent this problem is to look 
at the local neighborhood of the node v w . Let us 
ask the question - what is the probability that two 
randomly chosen neighbors of v w in Gq are con- 
nected to each other? If this probability is high, then 
we can expect the local neighborhood of v w to be 
dense in the sense that almost all the neighbors of 
v w are connected to each other forming a clique-like 
local structure. Since v w i is a neighbor of v w , it is 
a part of this dense cluster, and therefore, its degree 
k(v w t) is of the order of k(v w ). On the other hand, 
if this probability is low, then even if k(v w ) is high, 
the space around v w is sparse, and the local neigh- 
borhood is star-like. In such a situation, we expect 
k(v w i) to be low. 

The topological property that measures the prob- 
ability of the neighbors of a node being connected 
is called the clustering coefficient (CC). One of the 
ways to define the clustering coefficient C(v ) for a 
vertex v in a network is 



C(v) 



number of triangles connected to vertex v 
number of triplets centered on v 



For vertices with degree or 1, we put C{v) = 0. 
Then the clustering coefficient for the whole net- 
work {CC) is the mean CC of the nodes in the net- 
work. A corresponding weighted version of the CC 
(CC w t) can be defined by taking the node weights 
into account. 

Implication: The higher the value of 
k(v w )C(v w ) for a node, the higher is the probability 
that an NWE made while typing w is hard to correct 



due to the presence of a large number of ortho- 
graphic neighbors of the misspelling. Therefore, 
in a way (CC w t) reflects the level of difficulty in 
correcting NWE for the language in general. 

Observation and Inference: At threshold 1, 
the values of (CC) as well as {CC w t} is higher 
for Hindi (0.172 and 0.341 respectively) and Ben- 
gali (0.131 and 0.229 respectively) than that of En- 
glish (0.101 and 0.221 respectively), though for 
higher thresholds, the difference between the CC 
for the languages reduces. This observation further 
strengthens our claim that the level of difficulty in 
spelling error detection and correction are language 
dependent, and for the three languages studied, it is 
hardest for Hindi, followed by Bengali and English. 

4.2 Small World Property 

As an aside, it is interesting to see whether the Spell- 
Nets exhibit the so called small world effect that is 
prevalent in many social and natural systems (see 
dAlbert and Barabasi, 2002 



Newman, 2003 I for 



definition and examles). A network is said to be a 
small world if it has a high clustering coefficient and 
if the average shortest path between any two nodes 
of the network is small. 

Observation: We observe that SpellNets indeed 
feature a high CC that grows with the threshold. The 
average shortest path, denoted by (l) in Table [Q for 
9 = 1 is around 7 for all the languages, and reduces 
to around 3 for 9 = 3; at 9 = 5 the networks are 
near-cliques. Thus, SpellNet is a small world net- 
work. 

Implication: By the application of triangle in- 
equality of edit distance, it can be easily shown that 
(I) x 9 provides an upper bound on the average edit 
distance between all pairs of the words in the lexi- 
con. Thus, a small world network, which implies a 
small (I), in turn implies that as we increase the error 
bound (i.e. 9), the number of edges increases sharply 
in the network and soon the network becomes fully 
connected. Therefore, it becomes increasingly more 
difficult to correct or detect the errors, as any word 
can be a possible suggestion for any misspelling. In 
fact this is independently observed through the ex- 
ponential rise in M - the number of edges, and fall 
in (I) as we increase 9. 

Inference: It is impossible to correct very noisy 
texts, where the nature of the noise is random and 



words are distorted by a large edit distance (say 3 or 
more). 

5 Conclusion 

In this work, we have proposed the network of ortho- 
graphic neighbors of words or the SpellNet and stud- 
ied the structure of the same across three languages. 
We have also made an attempt to relate some of the 
topological properties of SpellNet to spelling error 
distribution and hardness of spell-checking in a lan- 
guage. The important observations of this study are 
summarized below. 

• The probability of RWE in a language can 
be equated to the average weighted degree of 
SpellNet. This probablity is highest in Hindi 
followed by Bengali and English. 

• In all the languages, the words that are more 
prone to undergo an RWE are more likely to be 
misspelt. Effectively, this makes RWE correc- 
tion very hard. 

• The hardness of NWE correction correlates 
with the weighted clustering coefficient of the 
network. This is highest for Hindi, followed by 
Bengali and English. 

• The basic topology of SpellNet seems to be an 
invariant across languages. For example, all 
the networks feature exponential degree distri- 
bution, high clustering, assortative mixing with 
respect to degree and node weight, small world 
effect and positive correlation between degree 
and node weight, and CC and degree. However, 
the networks vary to a large extent in terms of 
the actual values of some of these metrics. 

Arguably, the language-invariant properties of 
SpellNet can be attributed to the organization of 



the human mental lexicon (see ( Kapatsinski, 2006) 
and references therein), self-organization of ortho- 
graphic systems and certain properties of edit dis- 
tance measure. The differences across the lan- 
guages, perhaps, are an outcome of the specific or- 
thographic features, such as the size of the alphabet. 
Another interesting observation is that the phonemic 
nature of the orthography strongly correlates with 
the difficulty of spell-checking. Among the three 



languages, Hindi has the most phonemic and En- 
glish the least phonemic orthography. This corre- 
lation calls for further investigation. 

Throughout the present discussion, we have fo- 
cussed on spell-checkers that ignore the context; 
consequently, many of the aforementioned results, 
especially those involving spelling correction, are 
valid only for context-insensitive spell-checkers. 
Nevertheless, many of the practically useful spell- 
checkers incorporate context information and the 
current analysis on SpellNet can be extended for 
such spell-checkers by conceptualizing a network 
of words that capture the word co-occurrence pat- 
terns (Biemann, 2006). The word co-occurrence 
network can be superimposed on SpellNet and the 
properties of the resulting structure can be appro- 
priately analyzed to obtain similar bounds on hard- 
ness of context-sensitive spell-checkers. We deem 
this to be a part of our future work. Another way 
to improve the study could be to incorporate a more 
realistic measure for the orthographic similarity be- 
tween the words. Nevertheless, such a modification 
will have no effect on the analysis technique, though 
the results of the analysis may be different from the 
ones reported here. 

Appendix A: Derivation of the Probability 
of RWE 

We take a noisy channel approach, which 
is a common technique in NLP (for exam- 



ple ( |Brown et al., 1993] )), including spellcheck- 
ing (Kernighan et al., 1990). Depending on the 
situation, the channel may model typing or OCR 
errors. Suppose that a word w, while passing 
through the channel, gets transformed to a word 
w'. Therefore, the aim of spelling correction is to 
find the w* G A (the lexicon), which maximizes 
p(w*\w'), that is 



argmax p(w\w') = argmax p(w'\w)p{w) 

weA w£A 

(9) 

The likelihood p(w'\w) models the noisy 
channel, whereas the term p(w) is tradi- 
tionally referred to as the language model 
(see (Juraf sky and Martin, 2000) for an intro- 
duction). In this equation, as well as throughout this 



discussion, we shall assume a unigram language 
model, where p(w) is the normalized frequency of 
occurrence of w in a standard corpus. 

We define the probability of RWE for a word w, 

Prwe(w), as follows 



Pr 



E 

w'eA 



p(w'\w) 



(10) 



Stated differently, p rwe (w) is a measure of the prob- 
ability that while passing through the channel, w 
gets transformed into a form w' , such that w' G A 
and w' 7^ w. The probability of RWE in the lan- 
guage, denoted by p rwe (A), can then be defined in 
terms of the probability p TW e{w) as follows. 

p rwe (A) = Prwe(w)p(w) (11) 

= E E P{w'\w)p(w) 
weA w'eA 

In order to obtain an estimate of the likelihood 
p{w'\w), we use the concept of edit distance (also 
known as Levenstein distance (Levenst ein, 1965| >). 
We shall denote the edit distance between two words 
w and w' by ed(w, w'). If we assume that the proba- 
bility of a single error (i.e. a character deletion, sub- 
stitution or insertion) is p and errors are independent 
of each other, then we can approximate the likeli- 
hood estimate as follows. 



p(w'\w) 



ed(w,w') 



(12) 



Exponentiation of edit distance is a common mea- 
sure of word similarity or likelihood (see for exam- 
ple ( |Bailey and Hahn, 2001] )). 

Substituting for p(w'\w) in EqnfTTl we get 



.(A) = E E /' 

weA w'eA 

w^w' 



ed(w,w') 



p(w) (13) 
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